-
Notifications
You must be signed in to change notification settings - Fork 297
Add NFD image compatibility scheduler proposal. #2403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for kubernetes-sigs-nfd ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Hi @Xunli-Yang. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds KEP-2403, which proposes a compatibility scheduler plugin for Node Feature Discovery (NFD). Building on KEP-1845 (which established node compatibility validation), this proposal introduces automated scheduling capabilities to ensure pods are scheduled on nodes compatible with their container image requirements.
Key changes:
- Introduces three alternative solution designs for implementing image compatibility scheduling
- Proposes an
ImageCompatibilityPluginthat leveragesNodeFeatureGroupCRs to filter compatible nodes - Presents performance tradeoffs from basic validation (Solution 1) to optimized large-scale approaches (Solutions 2 and 3)
Reviewed changes
Copilot reviewed 1 out of 4 changed files in this pull request and generated 26 comments.
| File | Description |
|---|---|
| enhancements/2403-nfd-image-compatibility-scheduler/README.md | Complete KEP document proposing three solutions for image compatibility scheduling with detailed workflows, merits/demerits analysis, and test plans |
| enhancements/2403-nfd-image-compatibility-scheduler/solution1.png | Architectural diagram illustrating the basic NodeFeatureGroup check approach |
| enhancements/2403-nfd-image-compatibility-scheduler/solution2.png | Architectural diagram showing the SQLite database caching solution |
| enhancements/2403-nfd-image-compatibility-scheduler/solution3.png | Architectural diagram depicting the node pre-grouping optimization strategy |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| 1. **CR Creation and Update(Prefilter Phase):** When a pod with specific image requirements enters the scheduling queue, scheduler plugin fetches the attached OCI Artifact. It extracts the compatibility metadata (e.g., required kernel features) and **instantly creates a new `NodeFeatureGroup` CR**. This CR's specification defines the dynamic compatibility rules. | ||
|
|
||
| The `update NodeFeatureGroup` operation evaluates **all nodes in the cluster** against the CR's specification rules and updates the CR's `status` field with the list of nodes that satisfy the compatibility demands. | ||
|
|
||
| ```yaml | ||
| apiVersion: nfd.k8s-sigs.io/v1alpha1 | ||
| kind: NodeFeatureGroup | ||
| metadata: | ||
| name: node-feature-group-example | ||
| spec: | ||
| featureGroupRules: | ||
| - name: "kernel version" | ||
| matchFeatures: | ||
| - feature: kernel.version | ||
| matchExpressions: | ||
| major: {op: In, value: ["6"]} | ||
| status: | ||
| nodes: | ||
| - name: node-1 | ||
| - name: node-2 | ||
| - name: node-3 | ||
| ``` | ||
|
|
||
| 2. **Node Filtering (Filter Phase):** In the scheduler's final filter phase, retrieve the dynamically created `NodeFeatureGroup` CR and filters the candidate nodes, ensuring that only nodes listed in the CR's `status` are considered compatible. |
Copilot
AI
Dec 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing critical information about lifecycle management. The proposal mentions creating NodeFeatureGroup CRs dynamically during scheduling but doesn't address cleanup. When and how are these ephemeral CRs deleted? Without proper cleanup, they could accumulate and cause resource exhaustion. This is particularly important for Solution 1 and potentially Solution 3, which create CRs per scheduling request.
|
|
||
| The process involves three main phases: | ||
|
|
||
| 1. **Initial Cluster Grouping:** In the cluster preparation stage, administrator should divide the cluster nodes into several groups by `NodeFeatureGroup`. Multiple `NodeFeatureGroup` Custom Resources (CRs) are created declaratively, each defining a grouping rule. Their status is populated with all matching nodes, completing the pre-grouping setup. |
Copilot
AI
Dec 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing important implementation detail. The proposal mentions that "administrator should divide the cluster nodes into several groups by NodeFeatureGroup" but doesn't provide guidance on how to determine appropriate grouping rules or how many groups are optimal. Additionally, it doesn't address what happens when new nodes are added to the cluster - how are they assigned to groups? These are critical considerations for the practical implementation of this solution.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Xunli-Yang The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Co-authored-by: Joe Huang <[email protected]>
|
Hi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing sections:
- Risks and Mitigations
- Graduation Criteria
- Implementation Timeline
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Xunli-Yang and @ChaoyiHuang for this comprehensive proposal! This is exactly the kind of Phase 2 work we need to make NFD image compatibility production-ready.
/ok-to-test
Feedback
Preferred Direction: Solution 3 (Node Pre-Grouping)
I'm leaning toward Solution 3 for the following reasons:
-
Aligns with real-world cluster management - Large-scale operators already organize nodes into pools/groups based on hardware characteristics. This solution leverages existing practices rather than fighting against them.
-
O(G) vs O(N) is critical at scale - Evaluating 10 representative nodes vs 10,000 individual nodes is the difference between sub-millisecond and multi-second scheduling latency.
-
Simpler architecture - Unlike Solution 2 (SQLite), this doesn't require significant infrastructure changes to NFD master. The complexity is in the grouping strategy, not new storage backends.
-
Progressive path - We could start with Solution 1 as an MVP for small clusters, then add Solution 3 optimizations for scale. Solution 2 with SQL isn't necessarily over-engineered, but introducing a SQL database into NFD feels like a project on its own and out of scope for this proposal.
Questions
-
Scheduler plugin location: Have you considered building this as part of kubernetes-sigs/scheduler-plugins? That's the standard home for custom scheduler plugins and would give us the scheduling framework integration for free.
-
NFG lifecycle management: What's the cleanup strategy for ephemeral
NodeFeatureGroupCRs created during scheduling? Do they persist for caching, or are they garbage collected? -
Group homogeneity enforcement: For Solution 3, how do we validate/enforce that nodes within a pre-group are actually homogeneous? What happens if a node's features drift?
-
Failure modes: What happens if:
- The OCI artifact fetch fails during Prefilter?
- The NFG status is stale or the controller is slow to update?
- No groups match the compatibility requirements?
Missing KEP Sections
To align with standard KEP format, could you add:
- Risks and Mitigations
- Graduation Criteria (Alpha → Beta → GA)
- Implementation Timeline / Milestones
- Alternatives Considered (e.g., why not use node affinity directly?)
Great work on the diagrams - they really help visualize the three approaches. Looking forward to discussing this in the next community meeting!
|
Thanks @ArangoGutierrez, very valuable views for us. Aggree with that, solution 3 (Node Pre-grouping) is also what we'd like to recommend. As well the progressive path, we are working on a demo for the solution 1(as the base of solution 3). Just like you said, starting an MVP for small clusters can be the target of the first stage. Q&A
Yes, Our target is to integrate it into Kubernetes-sigs/scheduler-plugins as a common scheduling plugin. We expect to initially incubate in nfd sig.
Manually created CRs are long-lived, but the temporary CRs from the scheduler will be garbage-collected with a TTL. All these details will be added to the proposal.
The idea is that administrators are responsible for ensuring homogeneity when they define the pre-groups. It's manatory for cluster administrators and ups to the group strategy. And if a node drifts later, the pre-groups will be updated due to the update of NodeFeatureGroup. So there is no influence, only when drift happens after the scheduling process, the scheduled pods will not effect until next schedule time. --We'll need to add a monitoring mechanism to watch for drifted nodes, alert, and have administrators trigger rescheduling.
Good question. We’ll add the details of the solution along with the missing KEP sections. |
Add KEP: NFD image compatibility scheduler proposal.
What the proposal does?
Building upon the first phase of KEP-1845 Proposal, which completed node compatibility validation. This proposal introduces a compatibility scheduling plugin. The compatibility scheduler plugin automatically analyzes the compatibility requirements of container images, filters suitable nodes for scheduling, and ensures that containers run on compatible nodes.
Special notes for reviewer:
Based on the discussions on node-feature-discovery Slack channel, this proposal has presented three solutions and intends to get consensus on the implementation direction.
Co-authored-by: @ChaoyiHuang